Aim

Preliminary analysis to inform variables for inclusion in the Bayesian occupancy model. Candidate variables include:

Occupancy variables:

Note, X, Y, are also available. Kebele is available but as not all kebeles are sampled this would need to be a random effect.

Observation variables:

Note, further observation variables (canopy, understory, etc.) were collected but had too many missing variables to include in this instance. A more detailed habitat_type is also available.

This analysis will:

  1. briefly summarise data
  2. preview variables (plot bivariate relationships, correlations)
  3. GLM analysis (stepwise selection using AIC)
  4. RF analysis
  5. GLM analysis with RF selected variables (stepwise selection using AIC)
  6. MuMIn from top 1% of models identified from dredge of full model
  7. Summarise variables for inclusion in occupancy model

Data

Data are all cleaned and prepared in the sourced R script:

source("/Users/elaw/Desktop/LeuphanaProject/BirdModelling/ETH_birds/R/GLMandRFanalysis/GLMandRFanalysis_Data_v2.R")

This includes separate farm and forest data frames. All sites have 2 rounds of sampling. Farm data have 80 sites, and Forest has 65 sites. Start time and date were converted to numeric before being centered and scaled. All other continuous variables (including ordinal/count variables n_observers, visibility, and cloud, but excluding X,Y) will need to be scaled and centered (after transformation).

frmSR %>% summary()
##    point_id         round        SR         habitat     n_observers   
##  Length:160         1:80   Min.   : 2.000   FARM:160   Min.   :1.000  
##  Class :character   2:80   1st Qu.: 5.750   FOR :  0   1st Qu.:1.000  
##  Mode  :character          Median : 8.000              Median :1.000  
##                            Mean   : 7.969              Mean   :1.156  
##                            3rd Qu.:10.000              3rd Qu.:1.000  
##                            Max.   :18.000              Max.   :3.000  
##  recording     start               date             visibility   
##  0: 11     Min.   :-1.86057   Min.   :-1.784218   Min.   :0.000  
##  1:149     1st Qu.:-0.85994   1st Qu.:-0.681072   1st Qu.:2.000  
##            Median :-0.20436   Median : 0.095926   Median :2.000  
##            Mean   :-0.05453   Mean   : 0.001199   Mean   :2.062  
##            3rd Qu.: 0.81007   3rd Qu.: 1.016812   3rd Qu.:3.000  
##            Max.   : 1.93491   Max.   : 1.515626   Max.   :5.000  
##      cloud          kebele          hab_details              X        
##  Min.   :0.000   Length:160         Length:160         Min.   :36.05  
##  1st Qu.:0.000   Class :character   Class :character   1st Qu.:36.23  
##  Median :2.000   Mode  :character   Mode  :character   Median :36.40  
##  Mean   :1.656                                         Mean   :36.34  
##  3rd Qu.:3.000                                         3rd Qu.:36.42  
##  Max.   :3.000                                         Max.   :36.47  
##        Y           elevation        slope              twi         farm_type
##  Min.   :7.569   Min.   :1555   Min.   : 0.3981   Min.   : 4.648   0: 20    
##  1st Qu.:7.874   1st Qu.:1774   1st Qu.: 6.0388   1st Qu.: 5.462   1:140    
##  Median :7.923   Median :2024   Median : 8.6846   Median : 6.035            
##  Mean   :7.908   Mean   :2018   Mean   :10.3337   Mean   : 6.648            
##  3rd Qu.:8.009   3rd Qu.:2236   3rd Qu.:13.7028   3rd Qu.: 7.447            
##  Max.   :8.059   Max.   :2439   Max.   :30.9040   Max.   :13.850            
##     sidi1ha           sidi200           pwv1ha           pwv500       
##  Min.   :0.00000   Min.   :0.0919   Min.   : 0.000   Min.   : 0.6119  
##  1st Qu.:0.06679   1st Qu.:0.3095   1st Qu.: 0.000   1st Qu.: 5.8891  
##  Median :0.17883   Median :0.5165   Median : 0.000   Median : 9.2416  
##  Mean   :0.22413   Mean   :0.4576   Mean   : 2.346   Mean   :17.3681  
##  3rd Qu.:0.34480   3rd Qu.:0.6014   3rd Qu.: 1.543   3rd Qu.:31.3321  
##  Max.   :0.66663   Max.   :0.7683   Max.   :27.160   Max.   :61.0962  
##      fl_dis           fl_dis85      
##  Min.   :  30.00   Min.   :   0.00  
##  1st Qu.:  89.13   1st Qu.:  72.28  
##  Median : 189.87   Median : 370.90  
##  Mean   : 297.61   Mean   : 466.88  
##  3rd Qu.: 418.39   3rd Qu.: 733.33  
##  Max.   :1210.17   Max.   :1960.87
fstSR %>% summary()
##    point_id         round        SR         habitat     n_observers   
##  Length:130         1:65   Min.   : 2.000   FARM:  0   Min.   :1.000  
##  Class :character   2:65   1st Qu.: 5.000   FOR :130   1st Qu.:1.000  
##  Mode  :character          Median : 7.000              Median :1.000  
##                            Mean   : 7.485              Mean   :1.177  
##                            3rd Qu.:10.000              3rd Qu.:1.000  
##                            Max.   :13.000              Max.   :3.000  
##  recording     start               date             visibility   
##  0:  0     Min.   :-1.86057   Min.   :-1.784218   Min.   :0.000  
##  1:130     1st Qu.:-0.75643   1st Qu.:-0.661887   1st Qu.:2.000  
##            Median :-0.06634   Median : 0.000000   Median :2.000  
##            Mean   : 0.06711   Mean   :-0.001476   Mean   :1.915  
##            3rd Qu.: 0.82387   3rd Qu.: 0.824961   3rd Qu.:2.000  
##            Max.   : 2.55598   Max.   : 1.592366   Max.   :3.000  
##      cloud          kebele          hab_details              X        
##  Min.   :0.000   Length:130         Length:130         Min.   :36.06  
##  1st Qu.:0.000   Class :character   Class :character   1st Qu.:36.10  
##  Median :2.000   Mode  :character   Mode  :character   Median :36.20  
##  Mean   :1.469                                         Mean   :36.24  
##  3rd Qu.:2.000                                         3rd Qu.:36.36  
##  Max.   :3.000                                         Max.   :36.48  
##        Y           elevation        slope             twi         forest_type
##  Min.   :7.570   Min.   :1578   Min.   : 2.142   Min.   : 4.354   0:38       
##  1st Qu.:7.623   1st Qu.:1776   1st Qu.: 9.578   1st Qu.: 5.170   1:92       
##  Median :7.683   Median :1943   Median :12.264   Median : 5.683              
##  Mean   :7.788   Mean   :1980   Mean   :14.235   Mean   : 6.230              
##  3rd Qu.:7.990   3rd Qu.:2083   3rd Qu.:19.426   3rd Qu.: 7.023              
##  Max.   :8.060   Max.   :2547   Max.   :31.282   Max.   :11.287              
##      fr_dis          fr_dis85          hli            pwv2km     
##  Min.   : 10.00   Min.   :  0.0   Min.   :1.053   Min.   :10.26  
##  1st Qu.: 82.46   1st Qu.:  0.0   1st Qu.:1.110   1st Qu.:57.28  
##  Median :219.54   Median :114.0   Median :1.132   Median :77.89  
##  Mean   :274.87   Mean   :211.5   Mean   :1.129   Mean   :70.01  
##  3rd Qu.:435.66   3rd Qu.:325.6   3rd Qu.:1.150   3rd Qu.:86.12  
##  Max.   :910.22   Max.   :750.0   Max.   :1.192   Max.   :93.87
frmSR <- frmSR %>% 
  mutate(habitat2 = map_chr(hab_details, ~ ifelse(.x %>% str_starts("F"), "CR", .x))) 

fstSR <- fstSR %>% 
  mutate(habitat2 = map_chr(hab_details, ~ ifelse(.x %>% str_starts("C"), "CF", "NCF"))) 

Plots

Bivariate relationships with SR

Bivariate relationships are shown with loess (y~x, red), linear (y~x, blue) and quadratic (y~poly(x,2), yellow) lines, or boxplots where appropriate.

Farmland

Farmland bivariate relationships for occupancy variables

Farmland bivariate relationships for occupancy variables

Discussion of Farmland occupancy variables:

  • farm_type: unbalanced between 0:1. Median is higher but less spread for new farms (expected).
  • elevation: reasonably normal, and shows a small, negative relationship with SR (expected). No strong indication of the need for a quadratic term, but it could help extrapolation.
  • slope: skewed right, with a small, negative relationship with SR (?expected given influence on plants/farms?). No strong indication of the need for a quadratic term, but it could help extrapolation.
  • twi: skewed right, with a strong positive relationship with SR (4-5 spp difference between extremes)(?expected given influence plants/farms). No strong indication of the need for a quadratic term, but it could help extrapolation. Seems to be enough data to support the trend across the data.
  • sidi1ha: skewed right, with a small, positive relationship with SR (expected). No strong indication of the need for a quadratic term, but it could help extrapolation.
  • sidi200: reasonably normal, with a small positive linear relationship with SR (expected). Minor indication of a quadratic (highest in mid-ranges) but questionable levels of data on the extremes to support this, but it could help extrapolation.
  • pwv1ha: extremely right skewed with many observations at or near zero, with a small positive linear relationship with SR (expected). No indication of quadratic.
  • pwv500: right skewed, with a small positive linear relationship with SR (expected). No indication of quadratic.
  • fl_dis: right skewed, with a small negative linear relationship with SR (expected). Quadratic follows the loess line, but not well supported at the upper end.
  • fl_dis85: right skewed, with a small negative linear relationship with SR (expected). No indication of quadratic.

We may want to be careful with pwv1ha (due to skew, although relationship is our expected direction here), and twi (because of skew and strength of the relationship that we don’t really expect).

We may want to consider transforming slope, twi, sidi1ha, pwv500, fl_dis, and fl_dis85.

Farmland bivariate relationships for observation variables

Farmland bivariate relationships for observation variables

Discussion of Farmland observation variables:

  • round: larger median and higher spread in second round.
  • start time: reasonably normal, quadratic (highest in mid-range) relationship indicated, although positive linear is also reasonable.
  • date: reasonable spread, slight positive relationship with no strong indication of quadratic.
  • n_observers: skewed to one observer, and shows a negative linear relationship (with inconsistent medians). Suggest removing.
  • recording: most observations included recording, which had a larger spread. Higher median in no recording - not expected, suggest tentative removal.
  • visibility: positive trend, with the inclusion of the highest observation. Consider making visibility binary. As breaking between 0 and 1 (here, -2) would result in unbalanced data, breaking visibility between 1 and 2 (here >=0) results in a positive trend with increasing variability and reasonable spread.
  • cloud: reasonable spread, no consistent trend, suggest removal.

Forest

Forest bivariate relationships for occupancy variables

Forest bivariate relationships for occupancy variables

Discussion of forest occupancy variables:

  • forest_type: higher median in primary forests, same spread.
  • elevation: skewed right, strong positive relationship. Quadratic not advised based on this.
  • slope: reasonably normal, slight negative relationship. Quadratic not advised.
  • twi: skewed right, slight positive relationship. Not enough data to support quadratic in upper ranges.
  • fr_dis: skewed right, slight positive relationship. Not enough data to support quadratic in upper ranges.
  • fr_dis85: skewed right with elevated zeros, slight positive relationship. No strong support for quadratic but it could help extrapolation.
  • hli: relatively normal, no strong relationship with SR. Quadratic not advised.
  • pwv2km: skewed left, positive relationship with SR. Quadratic not advised.

Consider transforming elevation, twi, distance variables, and pwv2km.

Forest bivariate relationships for observation variables

Forest bivariate relationships for observation variables

Discussion of farmland observation variables:

  • round: higher in round 2.
  • start: slight negative linear relationship with SR, quadratic probably better to capture higher mid-ranges.
  • date: strong positive relationship with SR. Quadratic might be reasonable here, but data not thick enough in extremes.
  • n_observers: negative relationship with SR, possibly expected in this case as likely more sensitive species, but with low numbers in >1 observer, and a similar spread.
  • recording: was always made.
  • visibility: slight positive relationship, but not consistent in the medians. Consider making binary, breaking at 1-2 as in Farmland.
  • cloud: slight consistent, negative relationship, with no support for a quadratic. Consider making binary, to avoid issues with the ordinal variable, breaking in the middle gives consistent categories and captures the main trend.

Transforming variables

Variables we’d like to consider transforming include:

  • Farmland occupancy: slope, twi, sidi1ha, pwv500, fl_dis, and fl_dis85
  • Forest occupancy: elevation, twi, distance variables, and pwv2km
  • Farmland observation: visibility (make binary at break 1-2)
  • Forest observation: n_observers (make binary at break 1-2), visibility (make binary at break 1-2), cloud (make binary at break 1-2)

Potential transformations assessed using bestNormalize::bestNormalize, preferring conventional transformations where these were close to optimal

Apply transformations:

frmSR <- frmSR %>% 
  mutate(
    slope = sqrt(slope),
    twi = log10(twi+1),
    sidi1ha = sqrt(sidi1ha),
    pwv500 = log10(pwv500+1),
    fl_dis = sqrt(fl_dis),
    fl_dis85 = sqrt(fl_dis85),
    visibility = visibility > 1
  )

fstSR <- fstSR %>% 
  mutate(
    elevation = log10(elevation+1),
    twi = log10(twi+1),
    fr_dis = sqrt(fr_dis),
    fr_dis85 = sqrt(fr_dis85),
    n_observers = n_observers > 1,
    visibility = visibility > 1,
    cloud = cloud > 1
  )
Transformed farmland vars:
Farmland bivariate relationships for transformed variables

Farmland bivariate relationships for transformed variables

Discussion: all of these suggest a linear trend (not enough difference/data to support curves at extremes). twi remains quite skewed and should be treated with caution.

Transformed forest vars:
Forest bivariate relationships for transformed variables

Forest bivariate relationships for transformed variables

Discussion:

  • elevation retains positive relationship with SR, possibly due to reduced disturbance? Quadratic not advised here.
  • twi shows positive relationship, which may taper. Consider quadratic to capture this, but not well supported in the extremes.
  • fr_dis shows positive relationship, which could include quadratic but perhaps only to improve extrapolation.
  • fr_dis85 is positive, quadratic not advised.

Center - scale all continuous variables in both datasets

All other continuous variables (excluding binary, X,Y) will need to be scaled and centrered in order to improve comparability in the GLMs. Also need to remove those we identified as being problematic, to prevent their further use.

frmSR <- frmSR %>% 
  # select(-pwv1ha, -twi) %>% 
  select(-n_observers, -recording, -cloud) %>% 
  mutate(across(c(start, date, elevation, slope, twi, sidi1ha, sidi200, pwv1ha, pwv500, fl_dis, fl_dis85),
         ~scale(.x)[,1]))

fstSR <- fstSR %>% 
  select(-recording) %>% 
  mutate(across(c(start, date, elevation, slope, twi, fr_dis, fr_dis85, hli, pwv2km),
         ~scale(.x)[,1]))

Correlation plots

These show Pearson correlation using corrplot::corrplot

Farmland

Farmland variables - Pearson correlation

Farmland variables - Pearson correlation

Discussion of farmland correlations:

  • round and date are highly correlated. Suggest we include date as it carries more information.
  • elevation and sidi200 have pearson correlation of -0.6… remove sidi200 as elevation is our metric of climate change (correlated with temerature)
  • pwv500m and fl_dis have pearson correlation of -0.74 … remove pwv500 as fl_dis has better spread.
  • twi and slope have pearson correlation of -0.6 … remove twi as slope has better spread

Suggest removing round, sidi200, pwv500, twi

Forest

Forest variables - Pearson correlation

Forest variables - Pearson correlation

Discussion:

  • round and date have pearson correlation of 0.84. Suggest we include date as it carries more information.
  • elevation and fr_dis85 have pearson correlation of 0.63. Keep elevation as it is our climate change metric.
  • forest_type and fr_dis85 have pearson correlation of 0.73. Keep forest type, because of above
  • fr_dis and fr_dis85 have pearson correlation of 0.61. Keep fr_dis, because of above.
  • pwv2km and forest_type have pearson correlation of 0.65. keep forest_type because of above.
  • pwv2km and fr_dis have pearson correlation of 0.69. Keep fr_dis because of above.

Suggest removing round, fr_dis85, and pwv2km. Possibly also twi because of uncertain trend/spread.

Summary of the variables

Farmland

Variables: bold are retained

  • elevation (optional remove due to correlation with sidi200)
  • slope (sqrt)
  • twi (log10(x+1), remove due to correlation with slope and poor spread)
  • farm_type
  • sidi1ha(sqrt)
  • sidi200 (remove due to correlation with elevation)
  • pwv1ha (remove due to extreme skew)
  • pwv500 (log10(x+1), remove due to correlation)
  • fl_dis (sqrt)
  • fl_dis85 (sqrt, remove due to correlation)
  • round (remove due to correlation)
  • start (quadratic)
  • date
  • n_observers (remove, non-informative),
  • recording (remove, unbalanced and non-informative)
  • visibility (transform to binary)
  • cloud (remove, non-informative)

Forest

Variables: bold are retained Suggest removing round, fr_dis85, and pwv2km.

  • elevation (log10(x+1))
  • slope
  • twi (log10(x+1), remove due to correlation with slope and poor spread)
  • forest_type
  • fr_dis (sqrt)
  • fr_dis85 (sqrt, remove due to correlation)
  • hli
  • pwv2km (remove due to correlation)
  • round (remove due to correlation)
  • start (quadratic)
  • date
  • n_observers (transformed to binary)
  • recording (removed, uninformative)
  • visibility (transformed to binary)
  • cloud (transformed to binary)

GLM analyses

Starting from the full set of variables preselected above, conduct backwards step selection using AIC. Initial trial with poisson with log link (glm(family=poisson)) moving to negative binomial (glmmTMB(family=negbin2)) if over dispersed.

# Full farm formula
fffrm <- formula(SR ~ #habitat2 +
                  elevation + slope + farm_type + sidi1ha + fl_dis + 
                  poly(start,2) + date + visibility)

# Full forest formula
fffst <- formula(SR ~ #habitat2 + 
                  elevation + slope + forest_type + fr_dis + hli + 
                  poly(start,2) + date + n_observers + visibility + cloud)

Farm GLM

m0frm <- glm(formula = fffrm, 
             family = poisson,
             data = frmSR)
AER::dispersiontest(m0frm) # ok
## 
##  Overdispersion test
## 
## data:  m0frm
## z = 0.068533, p-value = 0.4727
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion 
##    1.00724
summary(m0frm)
## 
## Call:
## glm(formula = fffrm, family = poisson, data = frmSR)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.73926  -0.75673  -0.07539   0.62989   2.52268  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      2.29952    0.09750  23.585  < 2e-16 ***
## elevation       -0.04928    0.03364  -1.465  0.14300    
## slope           -0.03453    0.02917  -1.183  0.23663    
## farm_type1      -0.23225    0.08079  -2.875  0.00404 ** 
## sidi1ha          0.03321    0.03137   1.059  0.28971    
## fl_dis          -0.04836    0.03216  -1.504  0.13270    
## poly(start, 2)1  0.39163    0.40079   0.977  0.32849    
## poly(start, 2)2 -2.16105    0.43050  -5.020 5.17e-07 ***
## date             0.09698    0.03223   3.009  0.00262 ** 
## visibilityTRUE  -0.06135    0.08001  -0.767  0.44320    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 230.75  on 159  degrees of freedom
## Residual deviance: 165.66  on 150  degrees of freedom
## AIC: 799.77
## 
## Number of Fisher Scoring iterations: 4
m0frm.s <- MASS::stepAIC(m0frm)
## Start:  AIC=799.77
## SR ~ elevation + slope + farm_type + sidi1ha + fl_dis + poly(start, 
##     2) + date + visibility
## 
##                  Df Deviance    AIC
## - visibility      1   166.24 798.36
## - sidi1ha         1   166.78 798.90
## - slope           1   167.06 799.18
## <none>                165.66 799.77
## - elevation       1   167.81 799.92
## - fl_dis          1   167.93 800.04
## - farm_type       1   173.55 805.67
## - date            1   174.77 806.89
## - poly(start, 2)  2   192.66 822.78
## 
## Step:  AIC=798.36
## SR ~ elevation + slope + farm_type + sidi1ha + fl_dis + poly(start, 
##     2) + date
## 
##                  Df Deviance    AIC
## - sidi1ha         1   167.47 797.58
## - slope           1   167.71 797.82
## <none>                166.24 798.36
## - fl_dis          1   168.28 798.39
## - elevation       1   168.47 798.58
## - farm_type       1   174.05 804.16
## - date            1   174.80 804.92
## - poly(start, 2)  2   193.40 821.52
## 
## Step:  AIC=797.58
## SR ~ elevation + slope + farm_type + fl_dis + poly(start, 2) + 
##     date
## 
##                  Df Deviance    AIC
## - fl_dis          1   169.46 797.58
## <none>                167.47 797.58
## - slope           1   169.78 797.89
## - elevation       1   171.09 799.21
## - date            1   175.80 803.92
## - farm_type       1   176.70 804.81
## - poly(start, 2)  2   196.05 822.17
## 
## Step:  AIC=797.58
## SR ~ elevation + slope + farm_type + poly(start, 2) + date
## 
##                  Df Deviance    AIC
## <none>                169.46 797.58
## - slope           1   171.86 797.98
## - elevation       1   177.22 803.34
## - date            1   178.30 804.42
## - farm_type       1   179.20 805.32
## - poly(start, 2)  2   198.38 822.49
AER::dispersiontest(m0frm.s) # ok
## 
##  Overdispersion test
## 
## data:  m0frm.s
## z = 0.217, p-value = 0.4141
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion 
##   1.023231
summary(m0frm.s)  
## 
## Call:
## glm(formula = SR ~ elevation + slope + farm_type + poly(start, 
##     2) + date, family = poisson, data = frmSR)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.55309  -0.73695  -0.02547   0.65748   2.57939  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      2.27265    0.07283  31.206  < 2e-16 ***
## elevation       -0.08094    0.02899  -2.792  0.00524 ** 
## slope           -0.04407    0.02849  -1.547  0.12196    
## farm_type1      -0.25424    0.07924  -3.208  0.00133 ** 
## poly(start, 2)1  0.23528    0.39088   0.602  0.54723    
## poly(start, 2)2 -2.14839    0.41249  -5.208  1.9e-07 ***
## date             0.09062    0.03061   2.960  0.00307 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 230.75  on 159  degrees of freedom
## Residual deviance: 169.46  on 153  degrees of freedom
## AIC: 797.58
## 
## Number of Fisher Scoring iterations: 4

Selected farm model:

  • SR ~ elevation + slope + farm_type + poly(start, 2) + date
  • family = poisson

Forest

m0fst <- glm(formula = fffst, 
             family = poisson,
             data = fstSR)
AER::dispersiontest(m0fst) # ok
## 
##  Overdispersion test
## 
## data:  m0fst
## z = -2.787, p-value = 0.9973
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion 
##  0.7353065
summary(m0fst)
## 
## Call:
## glm(formula = fffst, family = poisson, data = fstSR)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.61736  -0.58497   0.09967   0.52791   2.02748  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      1.888824   0.111317  16.968  < 2e-16 ***
## elevation        0.074737   0.037069   2.016   0.0438 *  
## slope           -0.045704   0.034687  -1.318   0.1876    
## forest_type1     0.166732   0.089881   1.855   0.0636 .  
## fr_dis           0.024141   0.039619   0.609   0.5423    
## hli             -0.023714   0.033323  -0.712   0.4767    
## poly(start, 2)1 -0.699798   0.443292  -1.579   0.1144    
## poly(start, 2)2 -0.565145   0.433450  -1.304   0.1923    
## date             0.183204   0.036278   5.050 4.42e-07 ***
## n_observersTRUE -0.079474   0.096839  -0.821   0.4118    
## visibilityTRUE   0.002486   0.095980   0.026   0.9793    
## cloudTRUE       -0.029178   0.071446  -0.408   0.6830    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 163.63  on 129  degrees of freedom
## Residual deviance: 101.82  on 118  degrees of freedom
## AIC: 617.33
## 
## Number of Fisher Scoring iterations: 4
m0fst.s <- MASS::stepAIC(m0fst)
## Start:  AIC=617.33
## SR ~ elevation + slope + forest_type + fr_dis + hli + poly(start, 
##     2) + date + n_observers + visibility + cloud
## 
##                  Df Deviance    AIC
## - visibility      1   101.82 615.33
## - cloud           1   101.98 615.50
## - fr_dis          1   102.19 615.70
## - hli             1   102.32 615.83
## - n_observers     1   102.50 616.01
## - slope           1   103.56 617.08
## <none>                101.82 617.33
## - poly(start, 2)  2   106.58 618.09
## - forest_type     1   105.29 618.80
## - elevation       1   105.86 619.38
## - date            1   127.68 641.19
## 
## Step:  AIC=615.33
## SR ~ elevation + slope + forest_type + fr_dis + hli + poly(start, 
##     2) + date + n_observers + cloud
## 
##                  Df Deviance    AIC
## - cloud           1   101.99 613.50
## - fr_dis          1   102.19 613.70
## - hli             1   102.32 613.84
## - n_observers     1   102.51 614.02
## - slope           1   103.57 615.09
## <none>                101.82 615.33
## - poly(start, 2)  2   106.68 616.20
## - forest_type     1   105.29 616.81
## - elevation       1   105.94 617.46
## - date            1   129.67 641.19
## 
## Step:  AIC=613.5
## SR ~ elevation + slope + forest_type + fr_dis + hli + poly(start, 
##     2) + date + n_observers
## 
##                  Df Deviance    AIC
## - fr_dis          1   102.30 611.81
## - hli             1   102.58 612.09
## - n_observers     1   102.80 612.32
## - slope           1   103.67 613.18
## <none>                101.99 613.50
## - poly(start, 2)  2   107.15 614.66
## - forest_type     1   105.38 614.89
## - elevation       1   107.01 616.52
## - date            1   129.98 639.50
## 
## Step:  AIC=611.81
## SR ~ elevation + slope + forest_type + hli + poly(start, 2) + 
##     date + n_observers
## 
##                  Df Deviance    AIC
## - hli             1   102.78 610.30
## - n_observers     1   103.04 610.56
## - slope           1   104.12 611.64
## <none>                102.30 611.81
## - poly(start, 2)  2   107.45 612.97
## - elevation       1   107.79 615.31
## - forest_type     1   107.82 615.34
## - date            1   130.50 638.01
## 
## Step:  AIC=610.3
## SR ~ elevation + slope + forest_type + poly(start, 2) + date + 
##     n_observers
## 
##                  Df Deviance    AIC
## - n_observers     1   103.62 609.13
## <none>                102.78 610.30
## - slope           1   104.90 610.41
## - poly(start, 2)  2   107.80 611.31
## - forest_type     1   108.00 613.51
## - elevation       1   108.56 614.08
## - date            1   130.77 636.29
## 
## Step:  AIC=609.13
## SR ~ elevation + slope + forest_type + poly(start, 2) + date
## 
##                  Df Deviance    AIC
## - slope           1   105.47 608.99
## <none>                103.62 609.13
## - poly(start, 2)  2   109.30 610.81
## - forest_type     1   108.46 611.97
## - elevation       1   110.09 613.60
## - date            1   132.10 635.61
## 
## Step:  AIC=608.99
## SR ~ elevation + forest_type + poly(start, 2) + date
## 
##                  Df Deviance    AIC
## <none>                105.47 608.99
## - forest_type     1   109.71 611.22
## - elevation       1   110.89 612.41
## - poly(start, 2)  2   112.90 612.41
## - date            1   133.60 635.11
AER::dispersiontest(m0fst.s) # ok
## 
##  Overdispersion test
## 
## data:  m0fst.s
## z = -2.4729, p-value = 0.9933
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion 
##  0.7631657
summary(m0fst.s)  
## 
## Call:
## glm(formula = SR ~ elevation + forest_type + poly(start, 2) + 
##     date, family = poisson, data = fstSR)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3952  -0.6878   0.1326   0.5454   2.0767  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      1.86907    0.06657  28.079  < 2e-16 ***
## elevation        0.07926    0.03391   2.338   0.0194 *  
## forest_type1     0.16022    0.07863   2.038   0.0416 *  
## poly(start, 2)1 -0.85426    0.39031  -2.189   0.0286 *  
## poly(start, 2)2 -0.63185    0.41042  -1.540   0.1237    
## date             0.18423    0.03506   5.254 1.49e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 163.63  on 129  degrees of freedom
## Residual deviance: 105.47  on 124  degrees of freedom
## AIC: 608.99
## 
## Number of Fisher Scoring iterations: 4

Selected forest model:

  • SR ~ elevation + forest_type + poly(start, 2) + date
  • family = poisson

RF models

Random forest models, using the default parameters in the randomForest implementation, selecting variables with %IncMSE of < 5. Because this is a random process, we use 5 set.seeds, and average the results.

For the random forest models these need to remove the poly terms

# Full farm formula
fffrm2 <- formula(SR ~ 
                  elevation + slope + farm_type + sidi1ha + fl_dis + 
                  start + date + visibility)

# Full forest formula
fffst2 <- formula(SR ~ 
                  elevation + slope + forest_type + fr_dis + hli + 
                  start + date + n_observers + visibility + cloud)
myRF <- function(seed, .formula, .data){
  set.seed(seed)
  nme <- paste0("%IncMSE", seed)
  rf <- randomForest::randomForest(formula = .formula,
                                   data = .data, 
                                   importance = TRUE)
  imp <- randomForest::importance(rf) 
  
  imp %>% 
    as_tibble %>% 
    add_column(variable = dimnames(imp)[[1]], .before = 1) %>% 
    select(-IncNodePurity) %>% 
    arrange(variable) 
}

myRFset <- function(seeds, formula, data){
  map_dfc(seeds, ~myRF(.x, formula, data)) %>% 
    mutate(`mean%IncMSE` = rowMeans(across(where(is.numeric)))) %>% 
    select(variable = variable...1, `mean%IncMSE`) %>% 
    arrange(-`mean%IncMSE`)
}

Farm

myRFset(1:5, fffrm2, frmSR)
## # A tibble: 8 × 2
##   variable   `mean%IncMSE`
##   <chr>              <dbl>
## 1 date              14.6  
## 2 start              7.84 
## 3 fl_dis             7.66 
## 4 elevation          7.03 
## 5 farm_type          5.16 
## 6 sidi1ha            4.35 
## 7 slope              0.467
## 8 visibility        -0.819

RF selected farm variables: * [>5%IncMSE]: date, start, fl_dis, elevation, farm_type * [<5%IncMSE]: sidi1ha * [<1%IncMSE]: slope, visibility

Forest

myRFset(1:5, fffst2, fstSR)
## # A tibble: 10 × 2
##    variable    `mean%IncMSE`
##    <chr>               <dbl>
##  1 date               31.7  
##  2 elevation           9.24 
##  3 forest_type         5.45 
##  4 fr_dis              5.43 
##  5 hli                 4.29 
##  6 start               2.47 
##  7 n_observers         1.93 
##  8 cloud               0.865
##  9 visibility          0.566
## 10 slope              -1.08

RF selected forest variables: * [>5%IncMSE]: date, elevation, fr_dis, forest_type * [<5%IncMSE]: hli, start, twi, cloud, * [<1%IncMSE]: n_observers, visibility, slope

RF pre-selected GLM —-

Starting from the set of variables selected in the RF, conduct backwards step selection using AIC. Initial trial with poisson with log link (glm(family=poisson)) moving to negative binomial (glmmTMB(family=negbin2)) if overdispersed.

# Full farm formula
fffrm3 <- formula(SR ~ 
                  elevation + fl_dis + farm_type +
                  date + poly(start,2))

# Full forest formula
fffst3 <- formula(SR ~ 
                  elevation + fr_dis + forest_type + 
                  date)

Farm GLM

m2frm <- glm(formula = fffrm3, 
             family = poisson,
             data = frmSR)
AER::dispersiontest(m2frm) # ok
## 
##  Overdispersion test
## 
## data:  m2frm
## z = 0.25885, p-value = 0.3979
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion 
##   1.028516
summary(m2frm)
## 
## Call:
## glm(formula = fffrm3, family = poisson, data = frmSR)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.85952  -0.73446  -0.05434   0.63243   2.55008  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      2.27035    0.07288  31.151  < 2e-16 ***
## elevation       -0.05683    0.03208  -1.772  0.07645 .  
## fl_dis          -0.04619    0.03204  -1.442  0.14942    
## farm_type1      -0.25130    0.07928  -3.170  0.00153 ** 
## date             0.09112    0.03065   2.973  0.00295 ** 
## poly(start, 2)1  0.35208    0.38989   0.903  0.36651    
## poly(start, 2)2 -2.08370    0.41036  -5.078 3.82e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 230.75  on 159  degrees of freedom
## Residual deviance: 169.78  on 153  degrees of freedom
## AIC: 797.89
## 
## Number of Fisher Scoring iterations: 4
m2frm.s <- MASS::stepAIC(m2frm)
## Start:  AIC=797.89
## SR ~ elevation + fl_dis + farm_type + date + poly(start, 2)
## 
##                  Df Deviance    AIC
## <none>                169.78 797.89
## - fl_dis          1   171.86 797.98
## - elevation       1   172.93 799.04
## - date            1   178.69 804.81
## - farm_type       1   179.29 805.41
## - poly(start, 2)  2   197.66 821.77
AER::dispersiontest(m2frm.s) # ok
## 
##  Overdispersion test
## 
## data:  m2frm.s
## z = 0.25885, p-value = 0.3979
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion 
##   1.028516
summary(m2frm.s)  
## 
## Call:
## glm(formula = SR ~ elevation + fl_dis + farm_type + date + poly(start, 
##     2), family = poisson, data = frmSR)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.85952  -0.73446  -0.05434   0.63243   2.55008  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      2.27035    0.07288  31.151  < 2e-16 ***
## elevation       -0.05683    0.03208  -1.772  0.07645 .  
## fl_dis          -0.04619    0.03204  -1.442  0.14942    
## farm_type1      -0.25130    0.07928  -3.170  0.00153 ** 
## date             0.09112    0.03065   2.973  0.00295 ** 
## poly(start, 2)1  0.35208    0.38989   0.903  0.36651    
## poly(start, 2)2 -2.08370    0.41036  -5.078 3.82e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 230.75  on 159  degrees of freedom
## Residual deviance: 169.78  on 153  degrees of freedom
## AIC: 797.89
## 
## Number of Fisher Scoring iterations: 4

Selected farm model:

  • SR ~ elevation + fl_dis + farm_type + date + poly(start,2)
  • family = poisson

Forest

m2fst <- glm(formula = fffst3, 
             family = poisson,
             data = fstSR)
AER::dispersiontest(m2fst) # ok
## 
##  Overdispersion test
## 
## data:  m2fst
## z = -1.7812, p-value = 0.9626
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion 
##  0.8198381
summary(m2fst)
## 
## Call:
## glm(formula = fffst3, family = poisson, data = fstSR)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3850  -0.7216   0.1520   0.5802   2.2257  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.88370    0.07261  25.942  < 2e-16 ***
## elevation     0.06747    0.03380   1.996   0.0459 *  
## fr_dis        0.01849    0.03809   0.485   0.6274    
## forest_type1  0.14425    0.08892   1.622   0.1048    
## date          0.18867    0.03440   5.484 4.15e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 163.63  on 129  degrees of freedom
## Residual deviance: 112.66  on 125  degrees of freedom
## AIC: 614.18
## 
## Number of Fisher Scoring iterations: 4
m2fst.s <- MASS::stepAIC(m2fst)
## Start:  AIC=614.18
## SR ~ elevation + fr_dis + forest_type + date
## 
##               Df Deviance    AIC
## - fr_dis       1   112.90 612.41
## <none>             112.66 614.18
## - forest_type  1   115.32 614.83
## - elevation    1   116.61 616.12
## - date         1   143.16 642.68
## 
## Step:  AIC=612.41
## SR ~ elevation + forest_type + date
## 
##               Df Deviance    AIC
## <none>             112.90 612.41
## - elevation    1   117.10 614.62
## - forest_type  1   117.44 614.96
## - date         1   143.89 641.40
AER::dispersiontest(m2fst.s) # ok
## 
##  Overdispersion test
## 
## data:  m2fst.s
## z = -1.783, p-value = 0.9627
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion 
##  0.8208674
summary(m2fst.s)  
## 
## Call:
## glm(formula = SR ~ elevation + forest_type + date, family = poisson, 
##     data = fstSR)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3554  -0.7543   0.1550   0.6110   2.1909  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.86928    0.06635  28.172  < 2e-16 ***
## elevation     0.06910    0.03356   2.059   0.0395 *  
## forest_type1  0.16484    0.07810   2.111   0.0348 *  
## date          0.18958    0.03432   5.524 3.32e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 163.63  on 129  degrees of freedom
## Residual deviance: 112.90  on 126  degrees of freedom
## AIC: 612.41
## 
## Number of Fisher Scoring iterations: 4

Selected forest model:

  • SR ~ elevation + forest_type + date
  • family = poisson

MuMIn importance

Starting from the full model, use dredge to identify the top 10% of models. Conduct importance on these, selecting variables in at least 1/3 of the top models.

Farm

d0frm <- MuMIn::dredge(m0frm)
## Fixed term is "(Intercept)"
d2frm <- d0frm[i=1:(0.1*nrow(d0frm)), recalc.weights = TRUE, recalc.delta = FALSE]
MuMIn::sw(d2frm)
##                      date farm_type poly(start, 2) elevation fl_dis sidi1ha
## Sum of weights:      1.00 1.00      1.00           0.70      0.62   0.49   
## N containing models:   25   25        25             16        16     13   
##                      slope visibility
## Sum of weights:      0.44  0.30      
## N containing models:   12    12
nrow(d2frm)/3
## [1] 8.333333

Farm variables retained: date farm_type poly(start, 2) elevation fl_dis sidi1ha slope visibility

Forest

d0fst <- MuMIn::dredge(m0fst)
## Fixed term is "(Intercept)"
d2fst <- d0fst[i=1:(0.1*nrow(d0fst)), recalc.weights = TRUE, recalc.delta = FALSE]
MuMIn::sw(d2fst)
##                      date elevation forest_type poly(start, 2) slope
## Sum of weights:      1.00 0.90      0.81        0.76           0.47 
## N containing models:  102   83        77          70             49 
##                      n_observers fr_dis hli  cloud visibility
## Sum of weights:      0.26        0.25   0.23 0.21  0.16      
## N containing models:   31          32     29   29    21
nrow(d2fst)/3
## [1] 34

Forest variables retained: date elevation forest_type poly(start, 2) slope

Summary

Selecting variables for the multi-species occupancy model based on their relationships with species richness is a pragmatic approach. Some of the relationships seen in the species richness may not translate to individual species, but due to the low number of observations, we do need to reduce the number of variables we are estimating.

Farm variables selected by the different approaches:

suggest using:

Forest variables selected by the different approaches:

In the latter, n-observers was also one of the last de-selected. One to consider.

suggest using: